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Abstract 

We present improved methods of using struc¬ 
tured SVMs in a large-scale hierarchical classifi¬ 
cation problem, that is when labels are leaves, or 
sets of leaves, in a tree or a DAG. We examine the 
need to normalize both the regularization and the 
margin and show how doing so significantly im¬ 
proves performance, including allowing achiev¬ 
ing state-of-the-art results where unnormalized 
structured SVMs do not perform better than flat 
models. We also describe a further extension of 
hierarchical SVMs that highlight the connection 
between hierarchical SVMs and matrix factoriza¬ 
tion models. 


1. Introduction 

We consider the problem of hierarchical classification. 
That is, a classification problem when the labels are leaves 
in a large hierarchy or taxonomy specifying the relation¬ 
ship between labels. Such hierarchies have been exten¬ 
sively used to improve accuracy (Mccallum et al., 1998; 
Silla and Freitas, 2011; Vural and Dy, 2004) in domains 
such as document categorization (Cai and Hofmann, 2004), 
web content classification (Dumais and Chen, 2000), and 
image annotation (Huang et ah, 1998). In some problems, 
taking advantage of the hierarchy is essential since each in¬ 
dividual labels (leaves in the hierarchy) might have only a 
few training examples associated with it. 

We focus on hierarchical SVM (Cai and Hofmann, 2004), 
which is a structured SVM problem with the structure spec¬ 
ified by the given hierarchy. Structured SVMs are simple 
compared to other hierarchical classification methods, and 


yield convex optimization problems with straight-forward 
gradients. However, as we shall see, adapting structured 
SVMs to large-scale hierarchical problems can be prob¬ 
lematic and requires care. We will demonstrate that “stan¬ 
dard” hierarchical SVM suffers from several deficiencies, 
mostly related to lack of normalization with respect to dif¬ 
ferent path-length and different label sizes in multi-label 
problems, which might result in poor performance, possi¬ 
bly not providing any improvement over a “flat” method 
which ignores the hierarchy. To amend these problems, 
we present the Normalized Hierarchical SVM (NHSVM). 
The NHSVM is based on normalization weights which we 
set according to the hierarchy, but not based on the data. 
We then go one step further and learn these normalization 
weights discriminatively. Beyond improved performance, 
this results in a model that can be viewed as a constrained 
matrix factorization for multi-class classification, and al¬ 
lows us to understand the relationship between hierarchical 
SVMs and matrix-factorization based multi-class learning 
(Amit et al., 2007). 

We also extend hierarchical SVMs to issues frequently en¬ 
countered in practice, such as multi-label problems (each 
document might be labeled with several leaves) and tax¬ 
onomies that are DAGs rather then trees. 

We present a scalable training approach and apply our 
methods to large scale problems, with up to hundreds of 
thousands of labels and tens of millions of instances, ob¬ 
taining significant improvements over standard hierarchical 
SVMs and state-of-the-art results on a hierarchical classifi¬ 
cation benchmark. 

2. Related Work 

Much research was conducted regarding hierarchical multi¬ 
class or multi-label classification. The differences with 
other methods lies in normalization of structure, scalabil¬ 
ity of the optimization, and utilization of the existing label 
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structure. 

Our work is based upon hierarchical classification using 
SVM which is introduced in Cai and Hofmann (2004). 
The model extends the multi-class SVM to hierarchical 
structure. An extension to the multi-label case was pre¬ 
sented by Cai and Hofmann (2007). In Rousu et al. (2006), 
an efficient dual optimization method for a kernel-based 
structural SVM and weighted decomposable losses are pre¬ 
sented for a tree structured multi-label problem. These 
methods focus on dual optimization which does not scale 
up to our focused datasets with large instances and large 
number of labels. Also, the previous methods do not con¬ 
sider the normalization of the structures, which is important 
for such large structures. 

For instance, we focus on the Wikipedia dataset in 
Large Scale Hierarchical Text Classification Competition 
(LSHTC) 1 . It has 400K instances with a bag of words rep¬ 
resentation of wikipedia pages which are multi-labeled to 
its categories. The labels are the leaves from a DAG struc¬ 
ture with 65K nodes. Notice that the scale of dataset is 
very large compared to dataset considered in previous men¬ 
tioned methods. For instance, in Rousu et al. (2006) the 
largest dataset has 7K instances and 233 nodes. Extensions 
of KNN, meta-learning, and ensemble methods were pop¬ 
ular methods in the competition. 

Gopal and Yang (2013) presented a model with a multi¬ 
task objective and an efficient parallelizable optimization 
method for dataset with a large structure and number of 
instances. However, its regularization suffers the same 
normalization issue, and relies on the other meta learning 
method(Gopal and Yang, 2010) in the post-processing for 
high accuracy in multi-label problems. 

There are alternatives to SVMs approaches (Weinberger 
and Chapelle, 2009; Vural and Dy, 2004; Cesa-Bianchi 
et al., 2006), however, the approaches are not scalable to 
large scale dataset with large structures. 

Another direction is to learn the structure rather than uti¬ 
lizing given structure. Bravo et al. (2009); Blaschko et al. 
(2013) focus on learning a small structure from the data, 
which is is very different from using a known structure. A 
fast ranking method(Prabhu and Varma, 2014) is proposed 
for a large dataset. It builds a tree structure for ranking of 
labels. However, it does not utilize given hierarchy, and is 
not directly a multi-label classifier. 

3. Preliminaries 

Let Q be a tree or a directed acyclic graph (DAG) repre¬ 
senting a label structure with M nodes. Denote the set of 
leaves nodes in Q as C. For each n £ [M], define the 

1 http ://lshtc. iit .demokritos. gr/ 


sets of parent, children, ancestor, and descendent nodes 
of n as V(n), C(n), A{n), and T>(n), respectively. Ad¬ 
ditionally, denote the ancestor nodes of n including node 
n as A(n) = {n} U A(n ), and similarly, denote V(n) for 
T>(n) = {n} U U(n). We also extend the notation above 
for sets of nodes to indicate the union of the corresponding 
sets, i.e., V{A) = U ne A'P(n). 

Let {(a;*, Ui)}fLi be the training data of N instances. Each 
Xi £ is a feature vector and it is labeled with either a 
leaf (in single-label problems) or a set of leaves (in multi¬ 
label problems) of Q. We will represent the labels y, as 
subsets of the nodes of the graph, where we include the 
indicated leaves and all their ancestors. That is, the la¬ 
bel space (set of possible labels) is 34 = {4(()|( £ £} 
for single-label problems, and y m = {A(L)\L C £} for 
multi-label problems. 

4. Hierarchical Structured SVM 

We review the hierarchical structured SVM introduced in 
Cai and Hofmann (2004) and extended to the multi-label 
case in Cai and Hofmann (2007). Consider W £ K Mxd , 
and let the n-th row vector W n be be weights of the node 
n £ [M], Define 7 (x,y) to be the potential of label 
y given feature x, which is the sum of the inner prod¬ 
ucts of x with the weights of node n £ y, j(x, y) = 
J2ney Wn • x. If we vectorize W, w = vec(W ) = 
\Wi W 2 ■ ■ ■ W^ I ] T £ R d ' M , and define the class- 
attribute A(y) £ R M , [A (y)] n = 1 if n £ y or 0 otherwise 2 , 
then 

l{x,y) = 22 Wn ■ X = w ■ (A(y) (g> x) (1) 

nGj/ 

where < 8 > is the Kronecker product. With weights W n , pre¬ 
diction of an instance x amounts to finding the maximum 
response label 

y(x) = argmax yg ^ 7 (x, y) = argmax ye3; 22 W n x 

nEy 

Given a structural error A (y 1 , y), for instance a hamming 
distance A H {y',y) = \y' -y | = £ ne[M] \l n ey' - ^neyl 
a training a hierarchical structured SVM is optimizing: 

min || HA ||1 

W L ' 

n 

+ ^2 m g a ^ { WnXi - 22 W n Xi + A {y, yi) \ 

i \n£y n^yi ) 

(2) 


2 The class attributes could be variables, but only used as a 
fixed constant for mathematical conveniences without detail dis¬ 
cussions, and not used for normalization of the structure. 
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Figure 1. Regularization penalty for label y\ (left branch) is 
halved to without changing decision boundary due to difference 
in the label structure. 

Equivalently, in terms of w and class-attribute A (y). 


Note that || A (y)H 2 = y |^4(j/)| and the differences in regu¬ 
larization can grow linearly with the depth of the structure. 

To remedy this effect, for each node n we introduce a 
weight a n > 0 such that the sum of the weights along each 
path to a leaf is one, i.e., 

^ a n = 1, \/l e£. (4) 

neA(l) 

Given such weights, we define the normalized class- 
attribute A (y) £ M A/ and the normalized feature map 

(f)(x, y ) £ R d ' M , 


min 

w 


Awis+y; max{w • ((A (y) - A (y*)) <g> x { ) + A (y, yjj 


yey 


[A(y)]r 


(3) 



if y £ n 

otherwise 


4>{x,y) = A{y)®x (5) 


5. Normalized Hierarchical SVM 

A major issue we highlight is that unbalanced structures 
(which are frequently encountered in practice) lead to non- 
uniform regularization with the standard hierarchical SVM. 
To illustrate this issue, consider the two binary tree struc¬ 
tures with two leaves shown in figure 1. Implicitly both 
structures describes the same structure. Recall that the reg¬ 
ularization penalty is ||W|||i = ]G n ||W n ||f. where each 
row of W is a weight vector for each node. In the left 
structure, the class attributes are A ( 2 / 1 ) = [1 0] T , and 
A( 2 / 2 ) = [0 l] 7 , assume ||at ||2 = 1 , and let the optimal 
weights of node 1 and node 2 in the left structure be W* 
and W.J . Now add a node 3 as a child of node 1, so that 
M = 3,A(yi) = [1 0 1] T , A( 2 / 2 ) = [0 1 0] T . Let W{ 
and W 3 be the new weights for the nodes 1 and 3. If we as¬ 
sume W[ = ITj = \Wi, the potential function, and thus 
the decision boundary remain the same, but the regulariza¬ 
tion penalty for 2/1 is halved so that ||TTA( ||§ + 11 W 3111 = 
i || IV* HI, and ||W*|||. > ||TV / ||^. This can be generalized 
to any depth, and the regularization penalty can differ arbi¬ 
trarily for the model with the same decision boundary for 
different structures. In the given example, the structure on 
the right imposes half the penalty for the predictor of yi 
than that of 2/2 . 

The issue can also be understood in terms of the difference 
between the norms of A (y) for y £ y. Let y) £ 
the feature map for an instance vector x and a label y such 
that 7 (x, y) = w ■ <j>(x, y). From (1), 

w ■ (A (y) ® x) = w ■ 4>(x, y) 

A(y)®x behaves as a feature map in hierarchical structured 
SVM. While the model regularizes w, the norm of y) 
is different for y and scales as || A (y) H 2 - 

\h(x,y)h = II A (y) <S> a?|| 2 = || a (2/)II2 • M2 


The norm of these vectors are normalized to 1, independent 
of y, i.e., ||A(y )|| 2 = 1, ||^(aj, y )|| 2 = ||z || 2 for y £ 34, and 
the class attribute for each node n is fixed to 0 or -JoA,. for 
all labels. The choice of a is crucial and we present sev¬ 
eral alternatives (in our experiments, we choose between 
them using a hold-out set). For instance, using a n = 1 
on the leaves n £ £ and 0 otherwise will recover the flat 
model and lose all the information in the hierarchy. To re¬ 
frain from having a large number zero weight and preserve 
the information in the hierarchy, we consider setting a op¬ 
timizing: 

min a n 

S.t. ^ ^ OL n 

neA(l) 

Ctrl > 0 

where p > 1. In Section 5.2, we will show that as p —> 1, 
we obtain weights that remedy the effect of the redundant 
nodes shown in Figure 1. 

We use ( 6 ) with p = 2 as a possible way of setting the 
weights. However, when p = 1, the optimization problem 
( 6 ) is no longer strongly convex and it is possible to recover 
weights of zeroes for most nodes. Instead, for p = 1, we 
consider the alternative optimization for selecting weights: 


■> VZ £ £ (6) 

Vn G [M] 


max min a n 

n 

S.t. ^2 a n = 1) 

neA(l) 

^ Qp') 


MleC 

(7) 

Vn £ [M\ 

Vn £ [M],Vp G T’(n) 


We refer to the last constraint as a “directional constraint”, 
as it encourage more of the information to be carried by the 
leaves and results more even distribution of a. 
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For some DAG structures, constraining the sum 
^neA(i) an to exactly one can result in very flat 

solution. For DAG structures we therefore relax the 
constraint to 

1 < ° n - T ’ V/ £ C. (8) 

ndA(l) 

for some parameter T (T = 1.5 in our experiments). 

Another source of the imbalance is the non-uniformity of 
the required margin, which results from the norm of the 
differences of class-attributes, || A ( y) — A(z/') || 2 - The loss 
term of each instance in (3) is, max yG y w- (A(y) — A(t/i))(g> 
x + A{y, Hi). And to have a zero loss \/y £ y, 

A(y, yi) <w - ((A {y) - A (yi)) <g> x ) 


Also optimization (10) is equivalently written as 

minAIHIa + Y'maxw; • ((A (y) - A(y t )) <g> Xi) +A(y, y 1 ) 
W z -—' y&y 

( 12 ) 

Note that for the single-label problem, normalized hierar¬ 
chical SVM can be viewed as a multi-class SVM chang¬ 
ing the feature map function to (5) and the loss term to 
(9). Therefore, it can be easily applied to problems where 
flat SVM is used, and also popular optimization method for 
SVM, such as Shalev-Shwartz et al. (2007); Lacoste-Julien 
et al. (2013), can be used. 

Another possible variant of optimization (11) which we ex¬ 
periment with is obtained by dividing inside the max with 

\\A(y)-A( yi )\\ 2 : 


A (y,yi) works as the margin requirement to have a zero 
loss for y. The RHS of the bound scales as norm of A (y) — 
A (yi) scales. 

This calls for the use of structural error that scales with the 
bound. Define normalized structural error A (y, y, j 


A (y,Vi) = ||A(y) - A(j/j)|| 



(9) 


and yAy' = (yi — y)\J(y — yi) , and A (y) and a are defined 
in (5)(6). Without the normalization, this is the square root 
of the hamming distance, and is similar to a tree induced 
distance in (Dekel et al., 2004). This view of nonuniform 
margin gives a justification that the square root of hamming 
distance or tree induced distance is preferable to hamming 
distance. 


5.1. Normalized Hierarchical SVM model 


min Allwllo + > maxru- 

w 112 4^ yey 

% 


Kv)-Kvi) 

IIA (y) - A(Vi)h 



+ 1 
(13) 


There are two interesting properties of the optimization 
(13). The norm of the vector right side of w is normalized, 
i.e.. 


A (y) - A (yi) 

II A(y) — A(z/i) || 2 



INk 


Also the loss term per instance at the decision boundary, 
which is also the required margin, is normalized to 1. How¬ 
ever, because normalized class attribute in (13) does not de¬ 
compose w.r.t nodes as in (10), loss augmented inference in 
(13) is not efficient for multi-label problems. 


5.2. Invariance property of the normalized hierarchical 
SVM 


Summarizing the above discussion, we propose the Nor¬ 
malized Hierarchical SVM (NHSVM), which is given in 
terms of the following objective: 

tmnA^||W„||i+ 

n 

V'max V' yfobnW n Xi- V yfoinWnXi + A (y,yi) 

*■ — ‘ y&y *■ — * *■ — * 

i n£y 

( 10 ) 

Instead of imposing a weight for each node, with change of 
variables U n = ^/a^W n , we can write optimization (10) 
as changing regularization. 


As we saw in Figure 1, different hierarchical structures can 
be used to describe the same data, and this causes undesired 
regularization problems. However, this is a common prob¬ 
lem in real-world datasets. For instance, an action movie 
label can be further categorized into a cop-action movie 
and a hero-action movie in one dataset whereas the other 
dataset uses a action movie as a label. Therefore, it is 
desired for the learning method of hierarchical model to 
adapt to this difference and learn a similar model if given 
dataset describes similar data. Proposed normalization can 
be viewed as an adaptation to this kind of distortions. In 
particular, we show that NHSVM is invariant to node du¬ 
plication. 



+y] max y^ UnXi- V' U n Xi + A(y,yi) (11) 

z ' yey zc —" 

i ney neyi 


Define duplicated nodes as follows. Assume that there are 
no unseen nodes in the dataset, i.e., Vn £ [M],3i,n £ 
Define two nodes n\ and in [M] to be duplicated 
if Vi, n 1 £ yi ri2 £ Vi- Define the minimal graph 

M (G) to be the graph having a representative node per each 
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duplicated node set by merging each duplicated node set to 
a node. For the proof, see Appendix A. 

Theorem 1 (Invariance property of NHSVM). Decision 
boundary of NHSVM with Q is arbitrarily close to that of 
NHSVM with the minimum graph M(Q ) as p in (6) ap¬ 
proaches 1, p > 1. 


6. Shared SVM: Learning with Shared 
Frobenius norm 

In the NHSVM, we set the weights a based the graphi¬ 
cal structure of the hierarchy, but disregard the data itself. 
We presented several options for setting the weights, but 
it is not clear what the best setting would be, or whether 
a different setting altogether would be preferable. Instead, 
here we consider discriminative learning the weights from 
the data by optimizing a joint objective over the weights 
and the predictors. The resulting optimization is equivalent 
to regularization with a new norm which we call Struc¬ 
tured Shared Frobenius norm or Structured Shared norm. 
It explicitly incorporates the information of the label struc¬ 
ture Q. Regularization with the structured shared Frobe¬ 
nius norm promotes the models to utilize shared informa¬ 
tion, thus it is a complexity measure suitable for structured 
learning. Notice that we only consider multi-class problem 
in this section. An efficient algorithm for tree structure is 
discussed in section 7. 

Consider the formulation (11) as a joint optimization over 
both a and U = \fJ-[ f/J • • ■ f7^-] T with fixed A(y, yf) = 
A (l, f ) (i.e. we no longer normalize the margins, only the 
regularization): 


min 

U,a 



max 

ie[Y] 



s.t. 


y, u n xi+A(i,ii) 

n£A(li) 

J2 *n< 1, V/G[F] 

n^A(l) 


a n > 0 Vn € [M] 

(14) 

We can think of the first term as a regularization norm || • 
|| Si g and write 


min A||f7||2 - U h 

u ’ z ' ie j 

i 


■ Xi + A((, If) 


(15) 


where the the structured shared Frobenius norm || • || S> P is 


defined as: 


|| dike = nrin ||A|| 2 ^oo||F||f 

aeR M ,-KeR Mxd 

s.t. AV = U 


Ap n — 



otherwise 
n € A(l) 


VI, Vn 


a n >0, Vn £ [M\ 

( ! 6 ) 

where || ^4|| 2 —>-oo I s the maximum of the £2 norm of row 
vectors of A. Row vectors of A can be viewed as coeffi¬ 
cient vectors, and row vectors of V as factor vectors which 
decompose the matrix U. The factorization is constrained, 
though, and must represent the prescribed hierarchy. We 
will refer (15) to Shared SVM or SSVM. 


To better understand the SSVM, we can also define the 
Shared Frobenius norm without the structural constraint as 


||[7|| s = A mm / ||A|| 2 ^ 00 ||V|| F (17) 


The Shared Frobenius norm is a norm between the trace- 
norm (aka nuclear norm) and the max-norm (aka 72 : 1 —► 
00 norm), and an upper bounded by Frobenius norm: 

Theorem 2. ForVU € K rxc 

4=\\u\u< -±=\\u\\, <||Lf|| max 

v rc V c 

||E/|| S < \\U\\ B ,g < \\U\\ F 

where ||17||* = min^ w T =U II^II-fIIWHf is then the trace 
norm, and ||t/j| max = va\a. AW T =u ||A|| 2 -».oo||Wj|2-».oo w 
so-called the max norm (Srebro and Shraibman, 2005). 


Proof. The first inequality follows from the fact that 
^ r \\U\\ F < ||t/|| 2 ^oo, and the second inequality is from 
taking A = I, or Ap n = 1 when n is an unique node for l 
or 0 for all other nodes in ( 16) respectively. □ 


We compare the Shared norm to the other norms to illus¬ 
trate the behavior of the Shared norm, and summarize in 
Table 1 . Shared norm is upper bounded by Frobenius norm, 
and reduce from it only if sharing the factors V is benefi¬ 
cial. If there is no reduction from sharing as in disjoint 
feature case in Table 1, it equals to Frobenius norm, which 
is the norm used for multi-class SVM. Therefore, this jus¬ 
tifies the view of SSVM that it extends multi-class SVM 
to shared structure, i.e., SSVM is equivalent to multi-class 
SVM if no sharing of weights is beneficial. This differs 
from the trace norm, which we can see specifically in dis¬ 
joint feature case. 
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\\U\ \ 8 

\Ms,g 

\\u\\ F 

IMI. 

Full sharing 

IM|2 

IM|2 

vV||u|| 2 

VY\\u\\ 2 

No sharing 


nk 

\Mf 


Disjoint feature 

ve,.m \ 2 2 

v/EJNII 

VEAMl 

Ei IMh 

Factor scaling 

maxi \ai |u 2 



VE^iWuh 


Table 1. Comparing ||[/|| s , ||t/|| s ,g, ||(7 ||_f and ||[/||* in different situations. (1) Full sharing, U = [u u ... u] T , 3n', VI, n' £ A(l). 
(2) No sharing, Vi l', A(l) PI A(l') = 0. (3) Disjoint feature, U = [u\ u 2 ... uy] T ,VIi V 1 h, SuppjM;.,) n Supp(u i2 ) = 0. 
(4) Factor scaling, U = [a\u a 2 u... ayu], Unlike trace norm, Shared Frobenius norm reduces to Frobenius norm if no sharing is 
beneficial as in case of (3) disjoint feature. See the text and Appendix B for details. 


7. Optimization 

In this section, we discuss the details of optimizing objec¬ 
tives (10) and (14). Specifically, we show how to obtain the 
most violating label for multi-labels problems for objective 
(10) and an efficient algorithm to optimize objective (14). 

7.1. Calculating the most violating label for multi-label 
problems 

We optimize our training objective (10) using SGD 
(Shalev-Shwartz et al., 2007). In order to do so, the most 
challenging part is calculating 


V% = arg max V' v /a^W„a; i - V' y/a^W n Xi + A (y,yi) 
v&y 

n£y 

= arg max L, (y) (18) 

v&y 


at each iteration. For single label problems, we can calcu¬ 
late y, by enumerating all the labels. However, for a multi¬ 
label problem, this is intractable because of the exponential 
size of the label set. Therefore, in this subsection, we de¬ 
scribe how to calculate y t for multi-label problems. 

If Li(y) decomposes as a sum of functions with respect to 
its nodes, i.e., L^y) = J2 n L itn {\{n £ y}), then & can be 
found efficiently. Unfortunately, A (y, y-i) does not decom¬ 
pose with respect to the nodes. In order to allow efficient 
computation for multi-label problems, we actually replace 
A (y, yi) with a decomposing approximation AA(t/, yt) = 
\{y n {z/i -A} | = Eie£ ^-{i&y—yi} ~ Ipeyny;}) "T 
|{yi (T C}\ instead. When A cl(y, Vi) is used and the graph 
Q is a tree, j p can be computed in time O(M) using dy¬ 
namic programming. 

When the graph Q is a DAG, dynamic programming is not 
applicable. However, finding (18) in a DAG structure can 


be formulated into the following integer programming. 

M 

z = arg min V' z n ■ r n 
~~e{o,i }" n=1 

S.t ^2 z c > Z n , 

cGC(n) 

•Zc < Zn, 

lec 

wheie r n — ot n W n Xi T1 {n^y,,n6:,c} 1 {n£yi ,nE£}■ The 

feasible label from (19) is the set of labels where if a node 
n is in the label y, at least one of its child node is in y, i.e., 
Vn £ C,n £ y ==> 3c £ C(n), c £ y, and all the parents 
of n are in the label, i.e., Vn £ y => Vp £ V(n),p £ y. 
The feasible set is equivalent to y m . The search problem 
(19) can be shown to be NP-hard by reduction from the set 
cover problem. We relax the integer program into a linear 
program for training. Last constraint of Eipc Zl — 1 is not 
needed for an integer program, but yields a tighter LP re¬ 
laxation. In testing, we rely on the binary integer program¬ 
ming only if the solution to LP is not integral. In practice, 
integer programming solver is effective for this problem, 
only 3 to 7 times slower than linear relaxed program using 
gurobi solver (Gurobi Optimization, 2015). 

7.2. Optimizing with the shared norm 

Optimization (14) is a convex optimization jointly in U and 
a, and thus, has a global optimum. For the proof, see Ap¬ 
pendix C. 

Lemma 1. Optimization (14) is a convex optimization 
jointly in U and a. 

Since it is not clear how to jointly optimize efficiently with 
respect to U and a, we present an efficient method to op¬ 
timize (14) alternating between a and V n = E§=- Specif¬ 
ically, we show how to calculate the optimal a for a fixed 
U in closed form in time ()(M) when Q is a tree where M 
is the number of nodes in the graph, and for fixed a , we 
optimize the objective using SGD (Shalev-Shwartz et al., 


Vn 


Vn, Vc £ C(n) 


(19) 
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Algorithm 1 Calculate the optimal a in 

(20) for the tree 

structure Q in O(M). We assume nodes 

are sorted in in- 

creasing order of depth,i.e., \/n,p £ V(n), n > p. 

1 

Input: U £ R Mxd , a tree graph Q 


2 

Output: a £ R M . 


3 

Initialize: a = N = E = [0 0... 
[ii...i]et M 

o] e R m ,l = 

4 

for n = M —>• 1 do 


5 

N n £~ \\U n ,. HI) E n ■£- 0 


6 

if C(?r) 4 0 then 


7 

S EceC(n) N c 


8 

Q 

if '/S + s/N n ^ 0 then 

N n £- (V N n + y/~S) 2 ,E n ■£- 



s/~S + s/Nf 

10 

end if 


11 

end if 


12 

end for 


13 

Li £- 1 - Ei 


14 

for n = 2 —» M do 


15 

p £- V(n) 


16 

CX-n = yjL p • E n 


17 

Ln — Lp • (1 -En) 


18 

end for 


19 

return a 


2007) with change of variables V n = 4 = 



\ ' \\U n 2 

mm > - 

&n a n 

(20) 


nG [M] 



S.t ^2 a n< 1, 

Vy£y 


n£A(y) 



a n > 0, 

Vn £ Af. 

Algorithm 1 shows how to calculate optimum a in (20) in 
time 0{M) for a tree structure. See Appendix D. 

Lemma 2. For a tree structure Q, algorithm 1 finds optimal 


a in (20) in O(AT) in a closed form. 

In the experiments, we optimize a using algorithm 1 with 
U n = y/otnVn after a fixed number of epochs of SGD with 
respect to V. and repeat this until the objective function 
converges. We find that the algorithm is efficient enough to 
scale up to large datasets. 

8. Experiments 

We present experiments on both synthetic and real data 
sets. In Section 8.1, we consider synthetic data sets with 
both balanced structures and unbalanced structures (i.e. 
when some leaves in the class hierarchy are much deeper 
then others). We use this to demonstrate empirically the 



M 

d 

N 

14 

£ 

Synthetic(B) 

15 

IK 

8K 

8 

1 

Synthetic(U) 

19 

IK 

10K 

11 

1 

IPC 

553 

228K 

75K 

451 

1 

WIKI d5 

1512 

1000 

41K 

1218 

1.1 

ImageNet 

1676 

51K 

100K 

1000 

1 

DMOZ10 

17221 

165K 

163K 

12294 

1 

WIKI 

50312 

346K 

456K 

36504 

1.8 


Table 2. Data statistics: M is the number of nodes in the graph, d 
is the dimension of the features. N is the number of the instances. 
\C\ is the number of labels. £ is the average labels per instance. 
£ = 1 denotes a single-label dataset. 

vulnerability of un-normalized Hierarchical SVM to struc¬ 
ture imbalance, and how normalization solves this prob¬ 
lem. In particular, we will see how un-normalized HSVM 
does not achieve any performance gains over ’’flat” learn¬ 
ing (completely ignoring the structure), but our NHSVM 
model does leverage the structure and achieves much 
higher accuracy. Then, in Section 8.2, we compare our 
method to competing methods on mid-sized benchmark 
data sets, including ones with multiple labels per instance 
and with DAG structured hierarchies. Finally, in Section 
8.3 we demonstrate performance on the large-scale LSHTC 
competition data, showing significant gains over the previ¬ 
ously best published results and over other recently sug¬ 
gested methods. Data statistics is summarized in table 2. 

8.1. Synthetic Dataset 

In this subsection, we empirically demonstrate the bene¬ 
fit of the normalization with the intuitive hierarchical syn¬ 
thetic datasets. While even for a perfectly balanced struc¬ 
ture, we gain from the normalization, we show that the reg¬ 
ularization of the HSVM can suffer significantly from im¬ 
balance of the structure (i.e. when the depths of the leaves 
are very different). Notice that for a large structured dataset 
such as wikipedia dataset, the structure is very unbalanced. 

Balanced synthetic data is created as follows. A weight 
vector W n £ for each node n £ [2 3 — 1] in the com¬ 
plete balanced tree with depth 4 and an instance vector 
Xi £ R. d ,i £ [W],1V = 15000, for each instances are sam¬ 
pled from the standard multivariate normal distribution. In¬ 
stances are assigned to labels which have maximum poten¬ 
tial. To create the unbalanced synthetic data, we sample 
Xi £ from the multivariate normal distribution with 
d = 1, 000, i £ [iV], N = 10, 000, and normalize its norm 
to 1. We divide the space R' 1 with a random hyperplane re¬ 
cursively so that the divided spaces form an unbalanced bi¬ 
nary tree structure, a binary tree growing only in one direc¬ 
tion. Specifically, we divide the space into two spaces with 
a random hyperplane, which form two child spaces, and 
recursively divide only one of the child space with a ran- 
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Method 

IPC 

DMOZ 

WIKI d5 

Imagenet 

SSVM 

52.6±.069 :f 

45.5 

* 

* 

NHSVM 

52.2±.05 t 

45.5 

60±.87t 

8.0±.lt 

HSVM 

50.4±.09 

45.0 

58±1.1 

7.3±.16 

FlatSVM 

51.6±.08 

44.2 

57±1.3 

7.6±.08 


Method 

Balanced 

Unbalanced 

SSVM 

NHSVM 

HSVM 

FlatSVM 

63.4 ±.35 
63.3 ±.34t 
62.8±.39 
60±.24 

74.9±.4' t 

74.1±.2t 

68.4±.07 

68.5±.l 


Table 3. Accuracy on synthetic datasets, f shows that the the im¬ 
provements over FlatSVM and HSVM is statistically significant. 
J shows that the improvement over NHS VM is statistically signif¬ 
icant. 

dom hyperplane until the depth of the binary tree reaches 
10. Each x is assigned to leaf nodes if x falls into the cor¬ 
responding space. 

In both datasets, our proposed models, NHSVM and 
SSVM, are compared to HSVM (Cai and Hofmann, 2004), 
and flat SVM in the Table 3. For each experiments the 
different parameters are tested on the the holdout dataset. 
Fixed set of A is tested, A £ {10 —8 ,10 -7 ,..., 10 2 }. For 
NHSVM is tested with p = 2, and p = 1 in (10) and (13). 
Also p = 2 is tested with directional constraints. For both 
WIKI, T = 1.5 is used in (8). And each model with the 
parameters which had the best holdout error is trained with 
all the training data, and we report test errors. We repeated 
the test for 20 times, and report the mean and the standard 
deviations. Notice that HSVM fails to exploit the hierar¬ 
chical structure of the unbalanced dataset with the accu¬ 
racy less than flat model, whereas NHSVM achieves higher 
accuracy by 6% over flat model. The accuracy gain of 
NHSVM against HSVM for the balanced dataset, shows 
the advantage of (11) and normalized structured loss(9). 
For the unbalanced dataset, SSVM further achieves around 
1% higher accuracy compared to NHSVM learning the un¬ 
derlying structure from the data. For the balanced dataset, 
SSVM performs similar to NHSVM. 

8.2. Benchmark Datasets 

We show the benefit of our model on several real world 
benchmark datasets in different fields without restricting 
domain to the document classification, such as ImageNet 
in table 4. We followed same procedure described in sec¬ 
tion 8.1. Results show consistent improvements over our 
base models. NHSVM outperforms our base methods, 
and SSVM shows additional increases in the performance. 
DMOZ 2010 and WIKI-2011 are from FSHTC competi¬ 
tion. IPC 3 is a single label patent document dataset. DMOZ 
2010 is a single label web-page collection. WIKI-2011 is 
a multi-label dataset of wikipedia pages, depth is cut to 5 
(excluded labels with depth more than 5). ImageNet data 
(Russakovsky et ah, 2014) is a single label image data with 
SIFT BOW features from development kit 2010. WIKI and 


Table 4. Accuracy on benchmark datasets. * denotes that the al¬ 
gorithm was not able to be applied due to the graphical structure 
of the data. 


Method 

Accuracy 

NHSVM 

43.8 

HSVM 

41.2 

HR-SVM*(Gopal and Yang, 2013) 

41.79 

FastXMF**(Prabhu and Varma, 2014) 

31.6 

Competition Winner 

37.39 


Table 5. Results on full WIKI. *The inference of HR-SVM relies 
on the other meta learning method(Gopal and Yang, 2010) for 
high accuracy. ** NHSVM is used to predict the number of labels 
in the inference. 

ImageNet have DAG structures, and the others have tree 
structures. 

8.3. Result on LSHTC Competition 

We also compared our methods with the competition 
dataset. Targe Scale Hierarchical Text Classification Chal¬ 
lenge 2 4 . We compared with the winner of the competition 
as well as the the best published method we acknowledge 
so far, HR-SVM (Gopal and Yang, 2013). We also added 
comparisons with FastXMF(Prabhu and Varma, 2014) in 
the competition dataset. FastXMF is a very fast ranking 
method suitable for a large dataset. Since FastXMF pre¬ 
dicts rankings of full labels rather than list of labels, we 
predicted with the same number of labels as NHSVM, and 
compared the result. In table 5, we show the result on full 
competition dataset, WIKI-2011, and compare with results 
currently reported. NHSVM was able to adapt to the large 
scale of WIKI-2011 dataset with the state-of-the-art results. 
Only 98,519 features that appear in the test set are used 
with tf-idf type weighting BM25 (Robertson and Zaragoza, 
2009). With a computer with Intel Xeon CPU E5-2620 
processor, optimization took around 1.5 weeks in matlab 
without a warm start. 

9. Summary 

In this paper we considered the problem of large-scale hier¬ 
archical classification, with a given known hierarchy. Our 
starting point was hierarchical structured SVM of Cai and 
Hofmann (2004), and we also considered extensions for 


3 http://www.wipo.int/classifications/ipc/ 


4 http://lshtc.iit.demokritos.gr/ 
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handling multi-label problems (where each instance could 
be tagged with multiple labels from the hierarchy) and of 
label hierarchies given by DAGs, rather then rooted trees, 
over the labels. Our main contribution was pointing out 
a normalization problem with this framework, both in the 
effective regularization for labels of different depths, and 
in the loss associated with different length paths. We sug¬ 
gested a practical correction and showed how it yields to 
significant improvement in prediction accuracy. In fact, we 
demonstrate how on a variety of large-scale hierarchical 
classification tasks, including the Large-scale Hierarchi¬ 
cal Text Classification Competition data, our Normalized 
Hierarchical SVMs outperform all other relevant methods 
we are aware of (that work using the same data and can 
be scaled to the data set sizes). We also briefly discussed 
connections with matrix factorization approaches to multi¬ 
label classification and plan on investigating this direction 
further in future research. 
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Appendix A. Invariance property of NHSVM 

Theorem 1 (Invariance property of NHSVM). Decision bound¬ 
ary of NHSVM with Q is arbitrarily close to that of NHSVM with 
the minimum graph M(Q ) as p in (6) approaches 1, p > 1. 

Proof. We prove by showing that for any Q, variable a in (6) can 
be reduced to one variable per each set of duplicated nodes in 
Q using the optimality conditions, and optimizations (6)(10) are 
equivalent to the corresponding optimizations of M (Q) by change 
of the variables . 

Assume there are no duplicated leaves, however, the proof can 
be easily generalized for the duplicated leaves by introducing an 
additional constraint on y. 

Let J-(n') be a mapping from node n' in graph M(Q) to a corre¬ 
sponding set of duplicated nodes in Q. Denote the set of nodes in 
Q as A f, and the set of nodes in M(Q) as M ', and the set of leaves 
in M(Q) as C'. 

Consider (6) for Q. Note that (6) has a constraint on sum of a n 
to be 1 for n £ {n £ A(l)\l C £}. By the definition of the 
duplicity, if two nodes n i and ri 2 are duplicated nodes, they are 
the ancestors of the same set of the leaves, and term a ni appears 
in the first constraints of (6) if and only if term a n2 appears, thus 
we conclude that all the duplicated nodes will appear altogether. 
Consider a change of variable for each n' £ M' 


By substituting W" = 

(23) = IIW" II1 

n 

+ J2 (I] y/K^'W.n j • Xi 

i \n£y n€_yi / 

+ A(y,yi) 

(10),(6) for Q are equivalent to those of M(Q), thus two solu¬ 
tions are equivalent with a change of variables and the decision 
boundaries are the same. 

□ 

Appendix B. Behavior of Shared Frobenius 
Norm 

We first show a lower bound for || • || s , || • || s ,g which will be useful 
for the later proofs. 

Lemma 3. ForU £ R yxd , 

\\U\\s,g > ||t/|| s >max||[/jl 


K n . = ]T a n (21) 

n£F(n') 

Then, (6) are functions of K n / and (6) decompose w.r.t K n ’. 
From the convexity of function x p with p > 1, x > 0, 
and lensen's inequality, K n ,) p < 

minimum of (6) is attained when a n = for Vn £ 

T(n'). As t approaches 0, where e = p — 1 > 0, 

\Hn )I (j^Y = I Hn')\ e K, (22) 

neN n'eN' 

Plugging (22) (21) into (6), 

min I] K> 

n'ey' 

s.t. ^ K„, = L V y' £ y' 

n' Gy 7 

These formulations are same as (6) for M (Q). 

Thus given n', a„ = i * s fixed for Vn £ and with 

the same argument for W n in (10), change of variables gives , 
W' n , = Engjr( n /) W„. Then (10) is a minimization w.r.t W ' n ,, 

and the minimum is when W„ = |^"^| for Vn £ Tfn'), plug¬ 
ging this in (10), 


a E W)i^^+E^(ew)i- J 

n€Af I v /I ^ \n£y V 


K n 


\T(n')\ 


W' n , 

\T(n')\ 


Ei^') 


neyi 


Kn' W' n , 
\T{n')\ \T(n')\ 


where U y is y-th row vector ofU. 

Proof. Let A £ R yxM , V £ R MxD be the matrices which at¬ 
tain minimum in ||[/|| s = min J 4v=!7,||A|| 2 _ >00 <i ||V||f. Since 
A Ti .V . 7 c = U r ,c and from the cauchy-schwarz, ||(7 r , c || < 

\\Ar ,.\\2 • ||V) jC U 2 = ||V),cIJ 2 , and if we square both sides and 
sum over c, ||C/ r ,.||| < \\V\\p = ||(7||s which holds for all r. □ 

Following are the detailed descriptions for table 1 and the sketch 
of the proofs. 


Full sharing If all weights are same for all classes, i.e., U = 
[u u ... u] T £ R* xd for u £ R d , and there exists a node 
n that it is shared among all y, i.e., 3n, Vi, n £ A(l), then 


||s,s ~ Hulls whereas 


\\F = \\U\\i=Y 




ll^lls.S = ||w ||2 can be shown with matrix A = 


[1y,i Oy,m-i] £ R yxil/ and V 


u 

Om —l,d 


where 


n r , c £ R rXc is a matrix with all elements set ton. U — AV 
and the factorization attains the minimum of (17) since it at¬ 
tains the lower bound from lemma 3. \\U\\i = Y ■ ||w||| is 
easily shown from the fact that U is a rank one matrix with 
a singular value of y/Y - ||tz|| 2 - 


No sharing If there is no shared node, i.e., VI, l’ £ [Y],l A 
l',A(l)nA(n = d), then ||(/||^ = ||(7|||. 

To show this, let A and V be the matrices which attain the 
minimum of (16). m-th element of A y is zero for all y 
except one and V m ,d is nonzero only for one y such that 
m £ y. Therefore, (16) decomposes w.r.t A y and V y ,d, 
where V y ,d is the d-th column vector of V taking only for 
row y. 


min 

AV=U 


111 = 


EE 


min 

AyVy 7 d = Uy 7 d 


II Ev 


2 

2 


+£s(y,Vi) 


(23) 
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Given ||A y ||2 = 1, 


l II 2 — > I Ay ■ Vy^d\ — I Uy , d I 


And let A y = V^d/H E.dlh which attains the lower bound, 
.'•mm \\Vf F = '£'E\ U y’d \ 2 = \M 2 F 

V d 


Disjoint feature If U = [u\ it 2 ... u y] t G R^ xd for l G [Y], 
ui G R d , and the support of w y are all disjoint, i.e., Vyi ^ 
j/ 2 ,Supp(M yi ) nSupp^) = 0, then ||C/||f jCf = ||(7 ||f = 
T, y IKIli and \\U\\l = (J2 y \\uyhf- 

For || ■ ||„ it is similar to no sharing. The factorization de¬ 
composes w.r.t. each column u. For the trace norm, since 
the singular values are invariant to permutations of rows and 
columns, U can be transformed to a block diagonal matrix 
by permutations of rows and columns, and the singular val¬ 
ues decompose w.r.t block matrices with corresponding sin¬ 
gular values of ||My ||. 


Factor scaling If U = [aiu a 2 u ... ayu ] G R yxd for l G [L\, 
u G R d , then ||t/j| 2 = max; a 2 ||M|| 2 and \\U\\% = ||t/|| 2 = 

IMli ■ IMI i- 


Proof is similar to full sharing. For || ■ |L, A = --- 

max; di 

|[tti 02 ...ay] r Oy,m-i] and V = ma xnn „ U 

is a feasible solution which attains the minimum in lemma 3. 
For the trace norm, singular values can be easily computed 
with knowing U is a rank 1 matrix. 


Proof. Let f(n, l) = rain Eri E„ E S(„) 

where D denotes the union set of {n} and descendent nodes 
of n. the following recursive relationship holds, since Q has 
a tree structure. 




' \\Un\\ 2 2 

1 11 Un 111 

mm 0 <fc<i 

.+ EceC(n) ■ f ( C ’ ^(1 


if n is a leaf node 


k)) otherwise 

(24) 


If n is a parent node of leaf nodes, 




B i B 2 
0 ™l<i Efe + 1(1-k) 


(25) 


where C(n) denotes the set of children nodes of n, B 1 = 
II ||2 and B 2 = E cec(n) ll^clll- This has a closed form 
solution, 

f(n,l) = j(Vb; + VB^) 2 (26) 

and the minimum is attained at k = . For nodes p 

of n, f(p, l) will also have a form of (26), since the equation 
(26) has a form of leaf node, and the recursive relationship 
(24) holds. We continue this process until the root node r is 
reached, and f(r, 1) is the optimum. The optimal a can be 
calculate backward. 

□ 


Appendix C. Convexity of Shared 
Frobenius Norm optimization 

Lemma 1. Optimization (14) is a convex optimization 
jointly in U and a. 


Proof. Let f(U, a) = E n Ed f™A u ™, a n ) where f n ,d = 
f/^/a„. The Hessian of each f n ,d can be calcu¬ 
lated easily by differentiating twice. Then, the Hessian 
is a positive-semidefinite matrix for > 0, since 

d 2 fn,d 

if a n > 0, V 2 /„,d = 


PmF 

d 2 fn,d 


doi n dU n 

d 2 f n ,d 

(da n ) 2 


U r 


1 ^. 


Ui 


,d 


and if a n = 0 we can assume 


|| U„ 1 2 = 0 by restricting the domain and the hessian to be 

11 Un 11 2 

a zero matrix. Thus, E ~is a convex function jointly 

n 

in U„ and a n , and the lemma follows from the fact that the 
rest of the objective function in (14) is convex in U n . □ 


Appendix D. Closed form optimization of 

a 


Lemma 2. For a tree structure Q, algorithm 1 finds optimal 
a in (20) in O(M) in a closed form. 





















